#Introduction ##The two datasets I have created are called “Numeric” and “Categorical.” Each of these datasets contain information about cancer in countries across the world. The data was obtained from Cancer Atlas, which gives data on cancer to the public. They provide information in each country assessing the rates of risk of cancer, the current rates of cancer in the country, and ways the countries are trying to alleviate the prevelance of cancer. In the “Numeric” dataset, there are four variables: “Country or Territory,” “Cancer Survivors” diagnosed within the last five years across both sexes per 100,000 people, “Radiotherapy availability” or the number of machines available per 1000 patients, and “UICC Organizations” that are cancer organizations. In the “Categorical” dataset, there are three variables: “Country or Territory,” “Most common cancer cases worldwide (females),” and “Most common cancer cases worldwide (males).”
#This summer, I will be matriculating in Physician Assistant school. I find cancer research extremely intersting, and hopefully can use some statistical analysis in the future to observe effective treatments and help my patients to the best of my ability. I assume that the most common cancer cases between men and women will differ in each country. Different countries have different cultures, gentic histories, etc. I am also assuming that as the amount of cancer organizations or radiotherapy options increase, the cancer survivors will also increase. More cancer resources might help increase the chances for those with cancer.
##Setup
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
install.packages("cluster")
##
## The downloaded binary packages are in
## /var/folders/tr/wscl2rq11mg7lp1sk2bksgg00000gn/T//RtmpAvQ9gC/downloaded_packages
library(tidyverse)
library(cluster)
library(readxl)
library(readr)
Categorical <- read_excel("/Users/kyliewakefield/Documents/UT/Spring 2020/Website/content/Categorical.xlsx")
Numeric <- read_excel("/Users/kyliewakefield/Documents/UT/Spring 2020/Website/content/Numeric.xlsx")
glimpse(Numeric)
## Observations: 211
## Variables: 4
## $ `Country or Territory` <chr> "Puerto Rico", "Tokelau", "Afghanistan", …
## $ `Cancer survivors` <dbl> 793.5, NA, 151.0, 312.6, 228.1, 401.1, 40…
## $ `Radiotherapy availability` <dbl> 1.35, NA, 0.00, 0.75, 0.86, 0.60, 0.34, 0…
## $ `UICC organizations` <dbl> 1, 0, 2, 13, 1, 0, 2, 0, 6, 192, 1, 2, 7,…
glimpse(Categorical)
## Observations: 211
## Variables: 3
## $ `Country or Territory` <chr> "Puerto Rico", "Tokel…
## $ `Most common cancer cases worldwide (females)` <chr> "Breast", NA, "Breast…
## $ `Most common cancer cases worldwide (males)` <chr> "Prostate", NA, "Lip,…
##Joining/Merging
# Joining the two datasets by 'Country'. After ommitting the
# NAs, some country rows were deleted. The remaining
# observation amount is 191. In comparison to the original
# datasets with 211 observations each, this dataset lost 20
# observations. Now the numeric data and categorical data can
# be assessed properly without 'NA' observations
# interferring.
cancercountries <- Categorical %>% inner_join(Numeric)
cancercountries <- cancercountries %>% na.omit()
glimpse(cancercountries)
## Observations: 191
## Variables: 6
## $ `Country or Territory` <chr> "Puerto Rico", "Afgha…
## $ `Most common cancer cases worldwide (females)` <chr> "Breast", "Breast", "…
## $ `Most common cancer cases worldwide (males)` <chr> "Prostate", "Lip, ora…
## $ `Cancer survivors` <dbl> 793.5, 151.0, 312.6, …
## $ `Radiotherapy availability` <dbl> 1.35, 0.00, 0.75, 0.8…
## $ `UICC organizations` <dbl> 1, 2, 13, 1, 0, 2, 0,…
##Rearranging Wide/Long
# The joined dataset was untidied by using pivot_wider. The
# untidied dataset was re-tidied by using pivot_longer. When
# tidying, the NAs were deleted that were previosuly filled
# in by the pivot_wider function.
Untidy <- cancercountries %>% pivot_wider(names_from = `Most common cancer cases worldwide (females)`,
values_from = `Country or Territory`)
Tidy <- Untidy %>% pivot_longer(cols = c(5:9), names_to = "Most common cancer cases worldwide (females)",
values_to = "Country or Territory", values_drop_na = TRUE)
##Wrangling
# To find all of the countries that start with the letter 'U'
# I utilized the filter() and grepl() function. Then, I
# wanted to see out of this group which countries had the
# highest amount of cancer survivors and UICC organizations.
# According to the Cancer Atlas data, the United States has
# the highest number of cancer survivors per 100,000
# individuals. Uzbekistan has the lowest number of cancer
# surivors per 100,000 individuals. In addition, the United
# States has a larger percentage of cancer organizations than
# Ubekistan.
cancercountries %>% filter(grepl("U", `Country or Territory`)) %>%
select(`Country or Territory`, ends_with("s")) %>% arrange(desc(`Cancer survivors`),
desc(`UICC organizations`))
# I wanted to find out how many radiotherapy machines
# potentially helped these cancer patients, so I created a
# new column that gave the number of cancer survivors per
# radiotherapy machine in the country. First, I needed to
# find the number of radiotherapy machines out of 100,000
# cancer patients. The number of cancer survivors was taken
# out of a sample of 100,000 cancer patients. So, I
# mutliplied the radiotherapy variable by 100 to achieve the
# new estimated radiotherapy machines called 'Machines per
# 100,000' under the new dataset 'per100000.' The new column
# 'Number of Cancer Survivors Per Radiotherapy Machine' was
# create using mutate(). I divided the 'Cancer Survivors'
# column by the 'Machines per 100,000 patients' to get my new
# column in the dataset 'newcolumn.'
per100000 <- cancercountries %>% mutate(`Machines per 100,000 patients` = `Radiotherapy availability` *
100)
newcolumn <- per100000 %>% group_by(`Country or Territory`) %>%
mutate(`Number of Cancer Survivors Per Radiotherapy Machine` = `Cancer survivors`/`Machines per 100,000 patients`)
glimpse(newcolumn)
## Observations: 191
## Variables: 8
## Groups: Country or Territory [182]
## $ `Country or Territory` <chr> "Puerto Rico", …
## $ `Most common cancer cases worldwide (females)` <chr> "Breast", "Brea…
## $ `Most common cancer cases worldwide (males)` <chr> "Prostate", "Li…
## $ `Cancer survivors` <dbl> 793.5, 151.0, 3…
## $ `Radiotherapy availability` <dbl> 1.35, 0.00, 0.7…
## $ `UICC organizations` <dbl> 1, 2, 13, 1, 0,…
## $ `Machines per 100,000 patients` <dbl> 135, 0, 75, 86,…
## $ `Number of Cancer Survivors Per Radiotherapy Machine` <dbl> 5.8777778, Inf,…
# Out of all female participants taken around the world,
# Thyroid cancer has the highest number of cancer survivors.
# Cervical cancer and Lung cancer have the lowest number of
# cancer survivors worldwide.
cancercountries %>% group_by(`Most common cancer cases worldwide (females)`) %>%
summarise(`Mean Cancer Survivors` = mean(`Cancer survivors`)) %>%
arrange(desc(`Mean Cancer Survivors`))
# Out of all male participants taken around the world,
# Prostate cancer has the highest number of cancer survivors.
# Non-Hodgkin Lymphoma and Leukemia have the lowest number of
# cancer survivors worldwide.
cancercountries %>% group_by(`Most common cancer cases worldwide (males)`) %>%
summarise(`Mean Cancer Survivors` = mean(`Cancer survivors`)) %>%
arrange(desc(`Mean Cancer Survivors`))
# Summary statistics grouping by most common female cancer
# cases
meannumericfemales <- cancercountries %>% group_by(`Most common cancer cases worldwide (females)`) %>%
summarise(`Mean Cancer Survivors` = mean(`Cancer survivors`),
`Mean Radiotherapy` = mean(`Radiotherapy availability`),
`Mean Cancer Orgs` = mean(`UICC organizations`))
# grouping by most common male cancer cases
meannumericmales <- cancercountries %>% group_by(`Most common cancer cases worldwide (males)`) %>%
summarise(`Mean Cancer Survivors` = mean(`Cancer survivors`),
`Mean Radiotherapy` = mean(`Radiotherapy availability`),
`Mean Cancer Orgs` = mean(`UICC organizations`))
# some stats of all numeric variables
statsnumeric <- cancercountries %>% summarise(`Max Cancer Survivors` = max(`Cancer survivors`),
`Max Radiotherapy` = max(`Radiotherapy availability`), `Max Cancer Orgs` = max(`UICC organizations`),
`Min Cancer Survivors` = min(`Cancer survivors`), `Min Radiotherapy` = min(`Radiotherapy availability`),
`Min Cancer Orgs` = min(`UICC organizations`), `Median Cancer Survivors` = median(`Cancer survivors`),
`Median Radiotherapy` = median(`Radiotherapy availability`),
`Median Cancer Orgs` = median(`UICC organizations`))
# counting the nummber of categorical variables across all
# countries
countfemale <- cancercountries %>% group_by(`Most common cancer cases worldwide (females)`) %>%
count()
countmale <- cancercountries %>% group_by(`Most common cancer cases worldwide (males)`) %>%
count()
# Kable table
library(knitr)
kable(meannumericfemales, format = "markdown", caption = "Mean Female Case Values")
| Most common cancer cases worldwide (females) | Mean Cancer Survivors | Mean Radiotherapy | Mean Cancer Orgs |
|---|---|---|---|
| Breast | 475.7494 | 0.7310759 | 6.145570 |
| Cervix uteri | 203.4467 | 0.1740000 | 2.066667 |
| Liver | 308.7000 | 0.7200000 | 2.000000 |
| Lung | 226.9000 | 0.0700000 | 0.000000 |
| Thyroid | 880.7000 | 0.5800000 | 5.000000 |
kable(meannumericmales, format = "markdown", caption = "Mean Male Case Values")
| Most common cancer cases worldwide (males) | Mean Cancer Survivors | Mean Radiotherapy | Mean Cancer Orgs |
|---|---|---|---|
| Colorectum | 395.9462 | 1.3653846 | 4.384615 |
| Esophagus | 168.3000 | 0.2100000 | 5.000000 |
| Kaposi sarcoma | 218.8875 | 0.0675000 | 1.250000 |
| Leukemia | 155.7000 | 0.0100000 | 1.000000 |
| Lip, oral cavity | 216.4000 | 0.3080000 | 7.200000 |
| Liver | 207.3417 | 0.2641667 | 3.333333 |
| Lung | 389.7425 | 0.6272500 | 3.000000 |
| Non-Hodgkin lymphoma | 148.2333 | 0.1033333 | 1.000000 |
| Prostate | 531.8903 | 0.6696117 | 7.398058 |
| Stomach | 215.3000 | 0.9020000 | 1.200000 |
kable(statsnumeric, format = "markdown", caption = "Numeric Values")
| Max Cancer Survivors | Max Radiotherapy | Max Cancer Orgs | Min Cancer Survivors | Min Radiotherapy | Min Cancer Orgs | Median Cancer Survivors | Median Radiotherapy | Median Cancer Orgs |
|---|---|---|---|---|---|---|---|---|
| 1849.8 | 2.58 | 192 | 74.8 | 0 | 0 | 308.7 | 0.6 | 2 |
kable(countfemale, format = "markdown", caption = "Total Female Cancer Cases")
| Most common cancer cases worldwide (females) | n |
|---|---|
| Breast | 158 |
| Cervix uteri | 30 |
| Liver | 1 |
| Lung | 1 |
| Thyroid | 1 |
kable(countmale, format = "markdown", caption = "Total Male Cancer Cases")
| Most common cancer cases worldwide (males) | n |
|---|---|
| Colorectum | 13 |
| Esophagus | 1 |
| Kaposi sarcoma | 8 |
| Leukemia | 1 |
| Lip, oral cavity | 5 |
| Liver | 12 |
| Lung | 40 |
| Non-Hodgkin lymphoma | 3 |
| Prostate | 103 |
| Stomach | 5 |
# The first summary statistics I wanted to research were the
# mean cancer survivors, radiotherapy availability, and UICC
# organizations of each type of cancer case across females
# and males. To achieve this data, I used the group_by
# function to only search through the 'Most common cancer
# cases worldwide (females).' To get the statistical data,
# the summarize function was used. I found the mean() of each
# numeric variable under each case of cancer. The data was
# displayed through the kable function.The same code was used
# to find similar information for males around the world. In
# females, Thyroid and Breast cancer have some of the highest
# cancer survivor rates, highest number of radiotherapy
# machines available to those patients, and highest number of
# UICC organizations to help those kinds of patients. In
# males, prostate cancer has the highest mean cancer
# survivors, yet not the highest amounts of support to help
# those survivors.
# The next set of summary statistics highlights a general
# overview of the numeric variables. The median, max, and
# minimum of all of the variables are created through the
# summarize function. The kable table function is used to
# display all of these specific values. One of the most
# interesting finds through this table was the statistics on
# Radiotherapy availability. The maximum number of UICC
# organizations is 192, while the median was 2. This showed
# that several countries around the world have a small number
# of UICC organizations. The count functions were also used
# to show the variation among most common cancer cases for
# males and females.
# Correlation matrix Radiotherapy Availability and Cancer
# Survivors have the highest correlation.
cancercountriesmatrix <- cancercountries %>% select_if(is.numeric)
cor(cancercountriesmatrix)
## Cancer survivors Radiotherapy availability
## Cancer survivors 1.0000000 0.4749158
## Radiotherapy availability 0.4749158 1.0000000
## UICC organizations 0.3990374 0.1204279
## UICC organizations
## Cancer survivors 0.3990374
## Radiotherapy availability 0.1204279
## UICC organizations 1.0000000
##Visualizing
# Correlation heatmap of numeric variables As seen above in
# the correlation matrix, the number of cancer survivors and
# radiotherapy availability have the strongest correlation.
# This could be because an increase in radiotherapy machines
# leads to a better chance of survival.
tidycor <- cor(cancercountriesmatrix) %>% as.data.frame %>% rownames_to_column %>%
pivot_longer(-1, names_to = "name", values_to = "correlation")
head(tidycor)
tidycor %>% ggplot(aes(rowname, name, fill = correlation)) +
geom_tile() + geom_text(aes(label = round(correlation, 2)),
color = "black", size = 4) + xlab("") + ylab("") + scale_fill_gradient2(low = "red",
mid = "white", high = "blue") + theme(axis.text.x = element_text(angle = 0,
vjust = 1, size = 6)) + ggtitle("Correlation of Numeric Variables")
# Some cool plots! To visualize the correlation between
# radiotherapy availability and the number cancer survivors,
# I used a scatterplot. The different color dots represent
# the different types of cancers in females around the world.
# As the radiotherapy availability increases, so does the
# number of cancer survivors. This trend seems to be apparent
# for all of the most common cancer cases in females around
# the world. Breast cancer is the most common cancer case
# around the world, so more data points are distributed in
# dark pink.
cancercountries %>% ggplot(aes(`Radiotherapy availability`, `Cancer survivors`,
color = `Most common cancer cases worldwide (females)`)) +
geom_point(size = 2) + ggtitle("Relationship between Radioavailability & Cancer survivors") +
scale_color_brewer(palette = "PiYG") + theme(axis.text = element_text(colour = "dark green")) +
xlim(0, 3)
# This final graph highlights the mean number of cancer
# survivors depending on the type of most common cancer case
# in males. The mean number of cancer survivors was achieved
# through including the 'stat=summary' code within the
# geom_bar function. The different colors show a random pull
# of 12 countries in the dataset. Collectively, from the sum
# of male cancer patients from these countries, prostate
# cancer has the greatest number of cancer survivors. In
# particular, the Netherlands has the largest number of
# cancer survivors. When excluding the Netherlands, all of
# the other countries have similar cancer survivor rates no
# matter the cancer type.
cancercountries %>% slice(88:100) %>% ggplot(aes(x = `Most common cancer cases worldwide (males)`,
y = `Cancer survivors`, fill = `Country or Territory`)) +
geom_bar(stat = "summary", fun.y = "mean", position = "dodge") +
ggtitle("Cancer survivorship for different male cases worldwide") +
scale_fill_hue(l = 80, c = 120) + theme(legend.text = element_text(size = 8))
##Dimensionality Reduction
# estimating the number of clusters
dat2 <- cancercountries %>% select(`Cancer survivors`, `Radiotherapy availability`,
`UICC organizations`) %>% mutate_if(is.character, as.factor)
gower1 <- daisy(dat2, metric = "gower")
pam3 <- pam(gower1, k = 3, diss = T)
sil_width <- vector()
for (i in 2:10) {
pam_fit <- pam(gower1, diss = TRUE, k = i)
sil_width[i] <- pam_fit$silinfo$avg.width
}
ggplot() + geom_line(aes(x = 1:10), y = sil_width) + scale_x_continuous(name = "k",
breaks = 1:10)
pam2 <- cancercountries %>% select(`Cancer survivors`, `Radiotherapy availability`,
`UICC organizations`) %>% pam(2)
pam2
## Medoids:
## ID Cancer survivors Radiotherapy availability UICC organizations
## [1,] 79 775.5 0.73 2
## [2,] 52 238.4 1.19 2
## Clustering vector:
## [1] 1 2 2 2 2 2 2 1 1 1 2 1 2 2 2 2 2 2 2 1 1 2 1 1 2 1 2 2 2 2 2 1 2 2 2 2 1
## [38] 2 2 2 1 2 1 2 2 2 1 1 2 1 2 2 2 2 1 1 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 2 1
## [75] 2 2 2 1 1 2 1 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1
## [ reached getOption("max.print") -- omitted 91 entries ]
## Objective function:
## build swap
## 123.2092 111.8676
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
final <- cancercountries %>% mutate(cluster = as.factor(pam2$clustering))
confmat <- final %>% group_by(`Most common cancer cases worldwide (females)`) %>%
count(cluster) %>% arrange(desc(n)) %>% pivot_wider(names_from = "cluster",
values_from = "n", values_fill = list(n = 0))
confmat
round(sum(diag(as.matrix(confmat[, 2:3])))/sum(confmat[, 2:3]),
3)
## [1] 0.518
install.packages("plotly")
##
## The downloaded binary packages are in
## /var/folders/tr/wscl2rq11mg7lp1sk2bksgg00000gn/T//RtmpAvQ9gC/downloaded_packages
library(plotly)
final %>% plot_ly(x = ~`Radiotherapy availability`, y = ~`Cancer survivors`,
z = ~`UICC organizations`, color = ~cluster, type = "scatter3d",
mode = "markers", symbol = ~`Most common cancer cases worldwide (females)`,
symbols = c("circle", "x", "o"))
# To begin my dimensionality reduction, I first found the
# necessary number of clusters to get the best data. To do
# this, I observed the highest silhouette width. The
# silhouette width was displayed on a ggplot line graph.
# Although the line graph does not have large differences
# when changing the number of clusters, 2 clusters has a
# slightly increased silhouette width value. The clusters
# were set by selecting for numeric variables, running an
# accuracy test (accurracy of 51.8%), and plotting the
# clusters. The clusters are graphed by 3 variables:
# Radiotherapy availability, UICC organizations, and Cancer
# surivors.In each cluster, breast cancer patients typically
# had a higher number of cancer survivors. The green cluster
# had a large proportion of breast cancer patients,
# survivors, and radiotherapy availability. The purple
# cluster had a fair amount of cancer survivors and
# radiotherapy machines available. The purple cluster had a
# greater proportion of UICC organizations than the other
# cluster.